Multiclass Text Classification

Dhairav Chhatbar

DATA 698

Text classification or text categorization is the activity of automatically assigning/labeling documents which contain relevant content to its relevant predefined class. This process of tagging a document to its appropriate category is typically very manual and time intensive. While there have been many advances in searching by keyword, the limitation of searching is that it does not discriminate by context. This study looks to the industry problem of classifying documents into their respective categories so that the amount of time spent to open review a document and tag it for storage is minimized.

text1

Methodology

The general established methodology of text classification is to understand the dataset, pre-processing of the data, establish a set of features, run a set of classifier models, and then evaluate the models. While there are numerous combinations and permutations of the exact strategy to employ, this will largely depend on the dataset, available hardware, and goals of the objective

steps

Data Exploration

The dataset that we will use to build text classification models is the 20newsgroups data set that is openly available on http://qwone.com/~jason/20Newsgroups/ The dataset contains approximately 18,800 newsgroups documents organized across 20 different categories where each category represents a different topic of discussion the categories are below. While there are 20 topics within the dataset, there are similarities within given topics, these similar topics are shown in the below table. For example, we can expect similar discussion points of interest within the topics talk.religion.misc and alt.atheism, and there would be similar discussions within the topics comp.sys.ibm.pc.hardware and comp.sys.mac.hardware.

Computers Recreational Science Politics Religion Misc
comp.graphics rec.autos sci.crypt talk.politics.misc talk.religion.misc misc.forsale
comp.os.ms-windows.misc rec.motorcycles sci.electronics talk.politics.guns alt.atheism
comp.sys.ibm.pc.hardware rec.sport.baseball sci.med talk.politics.mideast soc.religion.christian
comp.sys.mac.hardware rec.sport.hockey sci.space
comp.windows.x

An example of a post can be seen below from the sci.space category. Note the formatting of the text is not restrictive of any characters. As such within our dataset the character content of the text is not limited to alphabets (upper and lower case), numbers nor special characters.

For the given 18.8k count of documents the distribution per topic is fairly even across all topics with approximately 1000 documents per topic except for talk.religion.misc, alt.atheism, talk.politics.misc, and talk.politics.guns.

Pre-processing

steps

As with non-language datasets we will perform the pre-processing and various cleaning methods to prior to building the features of the dataset. There are numerous strategies and methods of pre-processing textual data such as but not limited to lower casing, removal of stop words/frequent words/non-alphabetic words, stepping, lemmatization, spell corrections, emoji conversions, removal of URLs/HTML tags, etc. The application of a certain method or a set of methods (the pre-processing strategy) will largely depend on the nature of the text application at hand. In general, “cleaning” the corpus prior to running classifier models against it help reduce noise by removing “bits” of text that would otherwise be irrelevant in the grand scheme of identifying a document to the correct topic.

Basic clean up items:

Unicode

Image source: BetterProgramming

Remove Stop Words

Stop words are a subset of conjunctions, prepositions, and/or particle words. Examples are “as” “a”, “is”, “then”, “the”, and many others. Such which lend very limited mean to the context of document classification as they are generally observed across all topics, this is because stopwords are used to construct a sentence to give a sentence context, but do not contain any information about the sentence. The method of stop word removal was to utilize a set of pre-defined words list to create a custom list. The pre-defined stop words sets were from the sklearn.feature_extraction and nltk.corpus packages, combined with 4 ("don't", "like", "can't", "huh") custom words not found in the pre-defined packages.

Lemmatization

In the natural language a word will take different grammatic form, but contextually each of the forms holds the same meaning. For example, the words “change”, “changing”, “changes”, and “changed” hold roughly the same meaning. Normalizing occurrences to a single word “change” would reduce redundant features. Generally, there are two approaches to this normalization, Stemming or Lemmatization.

Normalization via Stemming is the process reducing words to their root and discarding the suffix. In the above example the root word is considered “chang” and the suffixs are considered “e”, “ed”, “ing”, “s”. Normalization via Lemmatization is reducing words by use of its vocabulary and position in a sentence. The normalized word is referred to a lemma and is often more understandable than the same stemmed word. For example, the words the words “change”, “changing”, “changes”, and “changed” will reduce to the lemma “change”. The is more interpretable than the stemmed word “chang”.

This pattern-based approach of stemming is fast and simple but is equivalent of using brute force. In comparison Lemmatization is more resource costly, but will generally produce more readable outcomes. Using Stemming over Lemmatization or vice versa has a precision/recall tradeoff, such that Stemming will increase recall at the expense of precision while Lemmatization will increase recall without hurting precision

Lemmatization

Image Credit: Importance of Text Pre-processing

Document Term Matrix

The above pre-processing steps are done and then our entire text corpus is turned into a Document Term Matrix (DTM) such that for the entirety of our tokens are represented by columns on the matrix and each post is represented as rows in the matrix. The columns are essentially features of the dataset and each row an observation. The DTM holds the term frequency per document per term such that:

DTM

Feature Engineering

steps

In this step for each document in D, the raw text will be transformed into a feature vector using the Term Frequency – Inverse Document Frequency vectorizer (TF-IDF). The TF-IDF aims to express importance of specific words in a given document di compared all words in the document collection D

Term Frequency – Inverse Document Frequency vectorizer (TF-IDF)

TFIDF
Image Credit: What and Why is TF-IDF Important for SEO

Once you have the IDF values, you can now compute the tf-idf scores for any document or set of documents.

Classification Models

steps

The step after TF-IDF Vectorization is to build classification models to which the training data is supplied. The models implemented here are Support Vector Machines, Naïve Bayes, and Random Forests.

For each model a set of parameters can be tuned to attain optimal results. We cycle through a set of hyper-parameters using GridSearchCV and RandomizedSearchCV within the sklearn package. GridSearch is an exhaustive method that cycle through every hyper-parameter and can be computationally expensive. Where feasible RandomizedSearch has been used to tune hyper-parameters, which have shown to be more efficient than GridSearch. Each model has been build using cross validation of the training data to minimizing overfitting.

Support Vector Machine

The primary goal of Support Vector Machines (SVM) is to determine a decision boundary between classes with maximum separation. Between classes there can be an infinite number of decision boundaries. The points from each class closest to the decision boundary are called Support Vectors. SVM looks to find an optimal decision boundary such that Support Vectors from each class have a maximum decision boundary and each class is on different sides of the decision boundary. Classification of a new unseen point (X_i) is determined on which side of the decision boundary X_i resides on. Below shows a simple binary classification with linearly separable classes and two features (X1 and X2).

SVM1

For data that is not linearly separable due to either overlapping or in an nth dimensional space for n number of features, SVM utilizes kernel functions (also called kernel trick) to transform the feature space from n dimensions to a higher n+y dimensions to achieve better separation of classes. The separating boundary in this higher dimension is called a hyperplane.

SVM2

Naïve Bayes Classifier (Multinomial)

Naïve Bayes is a set of probabilistic algorithms that utilizes Bayes’ Theorem and probability theory to make predictions such that the algorithm looks to determine the probability of an event A given the probability that another event B has already occurred.

Within the Naïve Bayes family of classifiers, the Multinomial Naïve Bayes classifier is most suited for text classification. This is because while both algorithms function similarly simple Naïve Bayes treats a post of a given class as a set of terms that are either present or not present, the Multinomial Naïve Bayes algorithm considers term frequency

nb
Image Credit: Naïve bayes Classifier From Scratch With Hands On Examples in R

Random Forest

A Decision Tree is a type of supervised learning model where the data is recursively split into two or more sub-populations given a criterion. Each split is headed by a node, where the upper most node is called the root node, and nodes which cannot be split further are called terminal (leaf) nodes. All other nodes are considered internal nodes. Based on a given population of observations, the population is split into sub-populations with the criteria that each split separates the sub-population better than the previous split. This recursive splitting cycles through the different predictor variables in order to find leaf nodes that make the purest class separation or most accurate predictions.

The Random Forest model is an ensemble method built from many individual decision trees and the classification output is determined by majority voting. The method aims to reduce the variance of individual trees and is particularly suited when predictors exhibit collinearity. Each tree in the forest is constructed using bootstrapped observations and a randomized subset of features. The forest growing procedure is controlled by two tuning parameters that we seek to optimize the number of trees in the forest and the number of random variables used to build each tree

RF

Model Evaluation

steps

We’ve ran 3 models on the same training and testing data. The below shows a summary for the training and testing metrics of each of the models

PR

Precision is defined as the ratio of correctly predicted positive observations to the overall total predicted positive observations. In other words, of the ones that were predicted as a topic, how many where actually that topic. Recall is defined as ratio of correctly predicted positive observations compared to all observations in this class. In other words, Recall is the ratio of how good the model is at identifying a topic. The F1-Score is the weighted average of the Precision and Recall Scores. For our dataset, we will want to maximize the F1-Score and while Support Vector Machine and Naïve Bayes models have the same F1-Scores, we can look at the individual F1-Scores per Topic

The below shows that the F1-Score is generally same across all models with some slight variation. All models performed poorly on talk.religion.misc and alt.athesim. talk.politics was another topic that performed poorly

The AUC-ROC curve shows that while Naïve Bayes performed poorly on one topic, it was exceptionally well on the other topics on par with SVM and Random Forest. Random Forest and Support Vector Models were computationally expensive and the execution time (~40mins, ~11mins respectively) of each show this. The Naïve Bayes model on the same testing and training datasets executed in seconds providing far less computational expense for similar accuracy results. Given it’s simplicity and time efficiency for this dataset and given hardware, Naïve Bayes would be the model of choice.

Future Work

There are additional approaches and adjustments that can be made to see if accuracy and F1-Scores can be further increases based on other studies referenced in this study. These additional areas are: